In this R Notebooks, I will explore the distribution of Zipcodes accross DMAs. When we map the DMAs based on the zipcodes, we get that a given DMA is widespread accross the US as shown in the map below. This is noise in the data. But what percentage is noise?
For a given zipcode, the DMA that it most frequently is associated with is the correct zipcode-DMA association. All the rest are outliers. Therefore, we will calculate the percentage of occurance of a given zipcode accross different DMAs to detect the outliers.
The table below shows sample of records of geo_zip and geo_dma with their count. These are the most frequent ones.
The table below shows the total instances by zipcode. It will help us to analyze what percentage of records of a given zipcode is found in a given DMA.
Now, let’s calculate percentage of a given zipcode records across a DMA.
Let’s see the distibution of the percentages.
We see two peaks near 0 and 100. Let’s use log scale to further understand the distribution.
[1] 32793
[1] 33089
We see that 296 zipcodes (33089-32793), which is about 0.9%, have less than 95% maximum percentage of occrance in a given DMA.
There are 1123285 outlier zipcodes out of a total of 451653467 records
This is 0.2487051 percent
correct_zip_dma = df %>% anti_join(outliers, by = c("geo_dma", "geo_zip")) %>%
select(geo_dma, geo_zip) %>% distinct()
in_correct_zip_dm = outliers %>% select(geo_dma, geo_zip) %>% distinct()
final_df = df %>% select(-geo_dma) %>% left_join(correct_zip_dma, by = "geo_zip")
write.csv(final_df, "clean_geo_zip_dma_mapping.csv", row.names = FALSE)
So, there are less than 0.25% of zipcodes outside of their DMA (outliers) assuming that those with maximum percentages for each geo-zip and geo-dma combination are correct. After replacing the incorrect ones with the correct zipcode-DMA mapping, we get the map below.